4/6/2021

Los Alamos National Laboratory

Research Themes

  • Computational Bayesian methods.

  • Spatio-temporal statistics.

  • Science-motivated statistical modeling.

Overview

  1. Spatio-temporal reconstruction of climate from pollen.
    1. Non-linear functional relationship.
      • Computationally challenging.
      • Non-linear and non-Gaussian relationships among data.
    2. Using data/parameter augmentation to improve computation.
  2. Recursive Bayesian inference for latent spatial data.
    • Reducing computational cost through parallelization.
    • Better Markov Chain Monte Carlo (MCMC) mixing.

Spatio-temporal reconstruction of climate from pollen

Pollen is a unique paleoclimate proxy

Modeling goal

  • Link the observed pollen counts to climate states during the modern period.

  • Use the learned relationship to predict unobserved climate state.

  • Generate climate histories that are local to the site of interest with uncertainty.

Tipton JR, Hooten MB, Nolan C, Booth RK, McLachlan J (2019). “Predicting paleoclimate from compositional data using multivariate Gaussian process inverse prediction.” The Annals of Applied Statistics, 13(4), 2363-2388.

Setting the stage

  • Spatially explicit reconstructions of climate variables is important.


    • Many important ecological questions are local.

    • Predictions at all locations remove the effects of sampling bias in paleoclimate reconstruction.

Setting the stage

  • Prior work – 4 pollen cores and total compute time of approximately 28 hours.


  • Currently: 363 sites:

    • non-linear response: compute time \(\approx\) 48 hours.
    • Pólya-gamma linear response: compute time less than 2 hours.

Holmström L, Ilvonen L, Seppä H, Veski S (2015). “A Bayesian spatiotemporal model for reconstructing climate from multiple pollen records.” The Annals of Applied Statistics, 9(3), 1194-1225.

Non-linear response model

Data model

  • Sediment samples from a lake.

  • Take 1cm\(^3\) cubes along the length of the sediment core.

  • In each cube, researcher counts the first \(M\) pollen grains and identifies to species/OTU.

  • Raw data are counts of each species.

  • Compositional count data.

Data Model

For the \(i\)th observation at location \(\mathbf{s}\) and time \(t\),


\[ \begin{align*} \mathbf{y}_i \left( \mathbf{s}, t \right) & = \left( y_{i1} \left( \mathbf{s}, t \right), \ldots, y_{id} \left( \mathbf{s}, t \right) \right)' \end{align*} \]


is a \(d\)-dimensional compositional count observation.


  • \(y_{ij} \left( \mathbf{s}, t \right)\) is the count of species \(j\) in the \(i\)th sample at location \(\mathbf{s}\) and time \(t\).

Data Model

\[ \begin{align*} \mathbf{y}_i \left( \mathbf{s}, t \right) | \mathbf{p}\left( \mathbf{s}, t \right) & \sim \operatorname{Multinomial} \left( M_i \left( \mathbf{s}, t \right), \mathbf{p}\left( \mathbf{s}, t \right) \right). \end{align*} \]


  • \(M_i \left( \mathbf{s}, t \right) = \sum_{j=1}^d y_{ij}\left( \mathbf{s}, t \right)\) is the total count observed (fixed and known) for observation \(i\) at location \(\mathbf{s}\) and time \(t\).

  • Observation informative of the relative proportions \(p_{j} \left( \mathbf{s}, t \right)\) only.

Data Model: Overdispersion

  • The pollen data are highly variable and overdispersed.


  • Mixture over a Dirichlet distribution.


\[ \begin{align*} \mathbf{p}\left( \mathbf{s}, t \right) | \boldsymbol{\alpha}\left( \mathbf{s}, t \right) & \sim \operatorname{Dirichlet} \left( \boldsymbol{\alpha}\left( \mathbf{s}, t \right) \right). \end{align*} \]


  • Marginalize out \(\mathbf{p} \left( \mathbf{s}, t \right)\).

\[ \begin{align*} \mathbf{y}_i\left( \mathbf{s}, t \right) | \boldsymbol{\alpha}\left( \mathbf{s}, t \right) & \sim \operatorname{Dirichlet-Multinomial} \left( M_i\left( \mathbf{s}, t \right), \boldsymbol{\alpha}\left( \mathbf{s}, t \right) \right). \end{align*} \]


Data Model: Overdispersion

  • Model the Dirichlet-multinomial random effect using the log link function:

\[ \begin{align*} \operatorname{log} \left( \boldsymbol{\alpha} \left( \mathbf{s}, t \right) \right) & = f \left( \mathbf{z}\left( \mathbf{s}, t \right) \right) \boldsymbol{\beta}. \end{align*} \]


  • \(\mathbf{z}\left( \mathbf{s}, t \right)\)’ is a \(q\)-dimensional vector of climate variables.

  • \(f(\cdot)\) is some function of the climate state.

  • \(\boldsymbol{\beta}\) is a \(q \times d\) dimensional matrix of regression coefficients.


Response Function

\[ \begin{align*} \operatorname{log} \left( \boldsymbol{\alpha} \left( \mathbf{s}, t \right) \right) & = f \left( \mathbf{z}\left( \mathbf{s}, t \right) \right) \boldsymbol{\beta}. \end{align*} \]


  • \(f \left( \mathbf{z}\left( \mathbf{s}, t \right) \right)\) is a basis expansion of the covariates \(\mathbf{z}\left( \mathbf{s}, t \right)\).
    • Use B-splines or Gaussian Processes as a basis.
    • \(\mathbf{z}\left( \mathbf{s}, t \right)\) is unknown for \(t \neq 1\).
    • Computationally challenging.

Tipton JR, Hooten MB, Nolan C, Booth RK, McLachlan J (2019). “Predicting paleoclimate from compositional data using multivariate Gaussian process inverse prediction.” The Annals of Applied Statistics, 13(4), 2363-2388.

Hefley TJ, Broms KM, Brost BM, Buderman FE, Kay SL, Scharf HR, Tipton JR, Williams PJ, Hooten MB (2017). “The basis function approach for modeling autocorrelation in ecological data.” Ecology, 98(3), 632-646.

Calibration

  • \(\mathbf{z} \left( \mathbf{s}, t \right)\)s are observed only at \(t\) = 1.


  • Calibration: Estimate \(\boldsymbol{\beta}\) using: \[ \begin{align*} \mathbf{y} \left(1\right) & = \left( \mathbf{y} \left( \mathbf{s}_1, 1 \right), \ldots, \mathbf{y} \left( \mathbf{s}_n, 1 \right) \right)' \\ \mathbf{z} \left(1\right) & = \left( \mathbf{z} \left( \mathbf{s}_1, 1 \right), \ldots, \mathbf{z} \left( \mathbf{s}_n, 1 \right) \right)'. \end{align*} \]


  • Reconstruction:
    • Use estimated \(\boldsymbol{\beta}\)s and fossil pollen \(\mathbf{y} \left( t \right)\) to predict unobserved \(\mathbf{z}\left( t \right)\).


Dynamic Model

  • For \(\mathbf{z} \left(t \right) = \left( \mathbf{z} \left(\mathbf{s}_1, t \right)', \ldots, \mathbf{z} \left(\mathbf{s}_n, t \right)' \right)\), we assume:

\[ \begin{align*} \mathbf{z} \left(t \right) - \mathbf{X} \left( t \right) \boldsymbol{\gamma} & = \mathbf{M}\left(t\right) \left( \mathbf{z} \left(t-1 \right) - \mathbf{X} \left( t \right) \boldsymbol{\gamma} \right) + \boldsymbol{\eta} \left(t \right). \end{align*} \]


  • \(\mathbf{M}(t) = \rho \mathbf{I}_n\) is a propagator matrix.

  • \(\mathbf{X} \left(t \right) \boldsymbol{\gamma}\) are the fixed effects from covariates like latitude, elevation, etc.

  • \(\boldsymbol{\eta} \left( t \right) \sim \operatorname{N} \left( \mathbf{0}, \mathbf{R}\left( \boldsymbol{\theta} \right) \right)\).

  • \(\mathbf{R} \left( \boldsymbol{\theta} \right)\) is a Mátern spatial covariance matrix with parameters \(\boldsymbol{\theta}\).

Elevation

Scaling the process for big data

  • Define a set of spatial knot locations \(\mathbf{s}^{\star} = \left\{ \mathbf{s}_1^{\star}, \ldots, \mathbf{s}_m^{\star} \right\}\).

\[ \boldsymbol{\eta}^{\star} \left( t \right) \sim \operatorname{N} \left( \mathbf{0}, \mathbf{R}^{\star}\left( \boldsymbol{\theta} \right) \right). \]


  • \(\mathbf{R}^{\star}\left( \boldsymbol{\theta} \right)\) is the spatial covariance defined at the knot locations \(\mathbf{s}^{\star}\).

  • Linear interpolator from observed locations \(\mathbf{s}\) to knot locations \(\mathbf{s}_j^{\star}\) is \[ \mathbf{r} \left(\mathbf{s}, \mathbf{s}_j^{\star} \right) \mathbf{R}^{\star}\left( \boldsymbol{\theta} \right)^{-1}, \] where \(\mathbf{r} \left(\mathbf{s}, \mathbf{s}_j^{\star} \right) = \operatorname{Cov} \left(\mathbf{s}, \mathbf{s}_j^{\star} \right)\).

Banerjee S, Gelfand AE, Finley AO, Sang H (2008). “Gaussian predictive process models for large spatial data sets.” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(4), 825-848.

Predictive Process

  • \(\boldsymbol{\eta} \left( t \right) \approx \mathbf{r} \left(\mathbf{s}, \mathbf{s}^{\star} \right) \mathbf{R}^{\star}\left( \boldsymbol{\theta} \right)^{-1} \tilde{\boldsymbol{\eta}} \left( t \right)\).

  • The predictive process can be shown to be the optimal predictor of the parent process \(\boldsymbol{\eta} \left( t \right)\) of dimension \(m\).

  • The dynamic climate process is approximated by

\[ \begin{align*} \mathbf{z} \left(t \right) - \mathbf{X} \left( t \right) \boldsymbol{\gamma} & = \mathbf{M}\left(t\right) \left( \mathbf{z} \left(t-1 \right) - \mathbf{X} \left( t \right) \boldsymbol{\gamma} \right) + \mathbf{r} \left(\mathbf{s}, \mathbf{s}^{\star} \right) \mathbf{R}^{\star}\left( \boldsymbol{\theta} \right)^{-1} \boldsymbol{\tilde{\eta}} \left(t \right). \end{align*} \]

Implementation

gitHub package


  • Includes options for multiple models including:
    • mixture models.
    • different likelihoods and link functions.
    • correlations in functional response.


  • Leverages code in C++ using Rcpp package for computation speed.

Computational Details

  1. Fit the calibration model.

  2. Use posterior distribution from stage (1) to generate predictions of climate independent in space and time.

  3. Smooth the posterior distribution from stage (2) using dynamic linear model.

    • Use posterior mean estimates which does not fully quantify model uncertainty.

    • Goal: Use recursive Bayesian ideas from end of talk.

  4. Fully Bayesian joint estimation.


Hooten MB, Johnson DS, Brost BM (2019). “Making Recursive Bayesian Inference Accessible.” The American Statistician, 1-10.

Estimation of \(\mathbf{z} \left(\mathbf{s}, t \right)\)

  • High-dimensional spatio-temporal process.
    • Inefficient to sample with block Metropolis.
    • Poor mixing of MCMC chains.


  • Non-linear transformation in the data model.
    • Difficult to use Kalman Filtering.


  • Particle Filtering Methods.
    • Difficult to implement, suffer from degeneracy.
    • Posterior can be multi-modal.

Estimation of \(\mathbf{z} \left(\mathbf{s}, t \right)\)

  • Elliptical Slice Sampling.
    • Assumes a Gaussian prior.
    • Difficult to implement in a DLM framework.
    • Requires no tuning.
    • Efficiently samples in high dimensions.
    • Easily explores multiple modes.

  • Adaptive block Metropolis within Gibbs and Elliptical Slice Sampling algorithms.

  • Highly multi-modal posterior is efficiently explored within the sampler.

Murray I, Adams RP, MacKay DJC (2010). “Elliptical slice sampling.” In AISTATS, volume 13, 541-548.

Estimation of \(\mathbf{z} \left(\mathbf{s}, t \right)\)

  • For Gaussian process expansion of \(\mathbf{z} \left( \mathbf{s}, t \right)\):
    • The latent climate states are inputs into the covariance function.
    • Covariance input locations (and distance) are random.
    • Unique computational challenge.
      • Total computational cost is prohibitive \(O(d \frac{(Tn)^3}{3})\).
  • Proposed solution:
    • Predictive process representation.
    • Reduced computation cost \(O(d T \frac{m^3}{3})\).

Tipton JR, Hooten MB, Nolan C, Booth RK, McLachlan J (2019). “Predicting paleoclimate from compositional data using multivariate Gaussian process inverse prediction.” The Annals of Applied Statistics, 13(4), 2363-2388.

Reconstruction

Reconstruction over time

Linear Pólya-gamma model

Data model

\[\begin{align*} \mathbf{y}(\mathbf{s}, t) & \sim \operatorname{Multinomial}(M(\mathbf{s}, t), \pi_{SB}(\boldsymbol{\eta}(\mathbf{s}, t))). \end{align*}\]


  • \(\mathbf{y}(\mathbf{s}, t)\) is a \(J\) dimensional vector of counts at site \(\mathbf{s} \in \mathcal{D}\) and time \(t \in \{1, \ldots, n_t\}\).

  • \(M(\mathbf{s}, t) = \sum_{j=1}^J \mathbf{y}_j(\mathbf{s}, t)\) is the total count observed.

  • Link the underlying climate states to the probability of observing species \(j\) through the latent variable \(\eta_j(\mathbf{s}, t)\).

Data model

  • \(\pi_{SB}(\boldsymbol{\eta}(\mathbf{s}, t))\) is a stick breaking transformation.


    • The map \(\pi_{SB}: \mathcal{R}^{J-1} \rightarrow \Delta^{J}\) transforms the \(J-1\) dimensional vector \(\boldsymbol{\eta}(\mathbf{s}, t)\) to the \(J\) dimensional unit simplex \(\mathcal{\Delta}^{J}\).

  • Other maps to the unit simplex could be used (i.e., multi-logit), but the stick-breaking map reduces computational cost.

Data model (ignoring spatio-temporal indexing)

\[\begin{align*} [\mathbf{y} | M, \pi_{SB}(\boldsymbol{\eta})] & =\operatorname{Multinomial}(\mathbf{y} | M, \pi_{SB}(\boldsymbol{\eta})) \\ & = \prod_{j=1}^{J-1} \operatorname{binomial}(y_j | M_j, \tilde{\pi}_j) \\ & = \prod_{j=}^{J-1} {M_j \choose y_j} \frac{(e^{\eta_j})^{y_j}}{(1 + e^{\eta_j})^{M_j} }. \end{align*}\]

  • \(M_j = M - \sum_{k < j} M_k\).

  • Define the partial probabilities \(\tilde{\pi}_j = \pi_{SB}(\boldsymbol{\eta})_j\) using the stick-breaking representation.

Data model (ignoring spatio-temporal indexing)

\[\begin{align*} \frac{(e^{\eta_j})^{y_j}}{(1 + e^{\eta_j})^{M_j} } & = 2^{-M_j} e^{\kappa(y_j) \eta_j} \int_0^\infty e^{- \omega_j \eta_j^2 / 2 } [\omega_j | M_j, 0] \,d \omega. \end{align*}\]


  • The integral identity is proportional to the product representation of the multinomial distribution.

  • The density \([\omega_j | M_j, 0]\) is a Pólya-gamma distribution.

Polson NG, Scott JG, Windle J (2013). “Bayesian inference for logistic models using Pólya-Gamma latent variables.” Journal of the American statistical Association.

Linderman S, Johnson M, Adams RP (2015). “Dependent multinomial models made easy: Stick-breaking with the Pólya-Gamma augmentation.” In Advances in Neural Information Processing Systems, 3456-3464.

Data model (ignoring spatio-temporal indexing)

  • With a prior \([\eta_j]\), joint density \([y_j, \eta_j]\) is

\[\begin{align*} [\eta_j, y_j] & = [\eta_j] {M_j \choose y_j} \frac{(e^{\eta_j})^{y_j}}{(1 + e^{\eta_j})^{M_j} }\\ & = \int_0^\infty [\eta_j] {M_j \choose y_j} 2^{-M_j} e^{\kappa(y_j) \eta_j} e^{- \omega_j \eta_j^2 / 2 } [\omega_j | M_j, 0] \,d \omega, \end{align*}\]


where the integral defines a joint density over \([\eta_j, y_j, \omega_j]\).

Data model

  • Using this integral representation, we have

    \[\begin{align*} \omega_j | \eta_j, y_j & \sim \operatorname{PG( M_j, \eta_j)}, \end{align*}\]

    which can be sampled using the exponential tilting property of the Pólya-gamma distribution.


  • If \([\eta_j]\) is Gaussian, then \([\eta_j | \omega_j, y_j]\) is also Gaussian which enables conjugate sampling.

Data model

  • There is a cost:


    • Requires sampling the \(\omega_j\)s and these are computationally expensive.

    • However, these are also embarrassingly parallel.

    • Efficiently sampled using openMP parallelization.

  • For all examples tested so far with reduced-rank spatial processes, sampling Pólya-gamma random variables is the limiting computational cost.

Process model – functional relationship

\[\begin{align*} \eta_j(\mathbf{s}, t) = \color{blue}{\beta_{0j}} + \color{blue}{\beta_{1j}} \left( \mathbf{x}'(\mathbf{s}, t) \boldsymbol{\gamma} + \mathbf{w}'(\mathbf{s}) \boldsymbol{\alpha}(t) \right) + \color{blue}{\varepsilon(\mathbf{s}, t)}. \end{align*}\]


  • \(\color{blue}{\beta_{0j}}\) and \(\color{blue}{\beta_{1j}}\) are regression coefficients with respect to the climate state \(\mathbf{Z}(\mathbf{s}, t) = \left( \mathbf{x}'(\mathbf{s}, t) \boldsymbol{\gamma} + \mathbf{w}'(\mathbf{s}) \boldsymbol{\alpha}(t) \right)\).

  • \(\color{blue}{\varepsilon(\mathbf{s}, t)} \stackrel{iid}{\sim} \operatorname{N}(0, \sigma^2_j)\) models overdispersion relative to the linear regression response.

Process model – climate process

\[\begin{align*} \eta_j(\mathbf{s}, t) = \beta_{0j} + \beta_{1j} \left( \color{red}{\mathbf{x}'(\mathbf{s}, t) \boldsymbol{\gamma}} + \color{purple}{\mathbf{w}'(\mathbf{s}) \boldsymbol{\alpha}(t)} \right) + \varepsilon(\mathbf{s}, t). \end{align*}\]

\(\color{red}{\mathbf{X}(t) = \begin{pmatrix} \mathbf{x}'(\mathbf{s}_1, t) \\ \vdots \\ \mathbf{x}'(\mathbf{s}_n, t) \end{pmatrix}}\) are fixed covariates (elevation, latitude, etc.).

  • We assume \(\color{red}{\mathbf{X}(t) \equiv \mathbf{X}}\) for all \(t\) although temporally varying covariates are possible (volcanic forcings, Milankovitch cycles, etc.).

  • \(\color{purple}{\mathbf{W} = \begin{pmatrix} \mathbf{w}'(\mathbf{s}_1) \\ \vdots \\ \mathbf{w}'(\mathbf{s}_n) \end{pmatrix}}\) are spatial basis functions with temporal random effects \(\color{purple}{\boldsymbol{\alpha}(t)}\).

  • \(\mathbf{Z}_0 \sim \operatorname{N} (\color{red}{\mathbf{X}'(1) \boldsymbol{\gamma}} + \color{purple}{\mathbf{W} \boldsymbol{\alpha}(1)}, \sigma^2_0 \mathbf{I})\) is the observed modern climate state.

Process model – \(\color{purple}{\mbox{random effects}}\)

  • Challenge: scaling to 1000s of sites and 15,000 years (\(\approx 60\) time increments).

  • Sparse, multiresolution representation with a dynamic linear model.

\[\begin{align*} \color{purple}{\mathbf{w}'(\mathbf{s}) \boldsymbol{\alpha}(t)} = \color{purple}{\sum_{m=1}^M \mathbf{w}_m'(\mathbf{s}) \boldsymbol{\alpha}_m(t)}. \end{align*}\]


  • \(\color{purple}{\mathbf{w}_m'(\mathbf{s})}\) is a Wendland basis over resolution \(m\).

Process model – \(\color{purple}{\mbox{random effects}}\)



Nychka D, Bandyopadhyay S, Hammerling D, Lindgren F, Sain S (2015). “A multiresolution Gaussian process model for the analysis of large spatial datasets.” Journal of Computational and Graphical Statistics, 24(2), 579-599.

Process model – \(\color{purple}{\mbox{random effects}}\)

\[\begin{align*} \color{purple}{\boldsymbol{\alpha}_m(t)} & \sim \operatorname{N} \left( \mathbf{A}_m \color{purple}{\boldsymbol{\alpha}_m(t-1)}, \tau^2 \mathbf{Q}_m^{-1} \right). \end{align*}\]


  • Assume the constraint \(\color{purple}{(\mathbf{W}_m \boldsymbol{\alpha}_m(t))'} \mathbf{1} = 0\) for all \(m\) and \(t\).

  • \(\mathbf{A}_m\) is the evolution matrix (assume \(\mathbf{A}_m \equiv \rho \mathbf{I}\) for all \(m\)).

  • \(\mathbf{Q}_m\) is the precision matrix constructed from either a CAR or SAR process over the \(m\)th resolution.

Recursive Bayesian Inference for spatial and spatio-temporal models

First order vs. second order representation

  • Consider the Gaussian linear mixed model - first order (mean) representation:

\[ \begin{align*} \mathbf{y} & = \mathbf{X} \boldsymbol{\beta} + \mathbf{Z} \boldsymbol{\alpha} + \boldsymbol{\varepsilon}. \end{align*} \]

  • \(\mathbf{X}\) is a set of fixed effect covariates.

  • \(\boldsymbol{\beta}\) are regression coefficients.

  • \(\mathbf{Z}\) is a set of random effect covariates and/or basis functions.

  • \(\boldsymbol{\alpha} \sim \operatorname{N}(\mathbf{0}, \boldsymbol{\Sigma}_{\alpha})\) are random effects.

  • \(\boldsymbol{\varepsilon} \sim \operatorname{N}(\mathbf{0}, \sigma^2 \mathbf{I})\).

First order vs. second order representation

  • First-order (mean) representation:

\[ \begin{align*} \mathbf{y} & \sim \operatorname{N} \left(\mathbf{X} \boldsymbol{\beta} + \mathbf{Z} \boldsymbol{\alpha}, \sigma^2 \mathbf{I} \right), \\ \boldsymbol{\alpha} & \sim \operatorname{N} \left( \mathbf{0}, \boldsymbol{\Sigma}_{\alpha} \right), \\ \boldsymbol{\beta} & \sim \operatorname{N} \left( \mathbf{0}, \boldsymbol{\Sigma}_{\beta} \right). \end{align*} \]

  • Second order (covariance) representation:

\[ \begin{align*} \mathbf{y} & \sim \int \int \operatorname{N} \left(\mathbf{X} \boldsymbol{\beta} + \mathbf{Z} \boldsymbol{\alpha}, \sigma^2 \mathbf{I} \right) \operatorname{N} \left( \mathbf{0}, \boldsymbol{\Sigma}_{\alpha} \right) \operatorname{N} \left( \mathbf{0}, \boldsymbol{\Sigma}_{\beta} \right) \, \mathrm{d} \boldsymbol{\alpha} \, \mathrm{d} \boldsymbol{\beta} \\ & \sim \operatorname{N} \left(\mathbf{0}, \sigma^2 \mathbf{I} + \mathbf{Z} \boldsymbol{\Sigma}_{\alpha}\mathbf{Z'} + \mathbf{X} \boldsymbol{\Sigma}_\beta \mathbf{X}' \right). \end{align*} \]

First order vs. second order representation

  • For Gaussian models, this integration is standard practice – improved mixing, convergence, and efficiency.

  • For non-Gaussian models, brute force or approximations are used.

    • Sample the latent \(\boldsymbol{\alpha}\)s and \(\boldsymbol{\beta}\)s directly.

    • Integrated nested Laplace approximations (INLA).

    • Replace the non-Gaussian likelihood by an approximating distribution that is Gaussian – Variational Inference.

    • Conjugate recursive Bayesian inference.

Example: spatial regression model

\[ \begin{align*} y \left( \mathbf{s} \right) & = \mathbf{X} \left( \mathbf{s} \right) \boldsymbol{\beta} + \eta \left( \mathbf{s} \right) + \varepsilon \left( \mathbf{s} \right), \\ \boldsymbol{\eta} & \sim \operatorname{N} \left( \mathbf{0}, \mathbf{C} \right). \end{align*} \]

  • \(\boldsymbol{\eta} = \left( \eta \left( \mathbf{s}_1 \right), \ldots, \eta \left( \mathbf{s}_n \right) \right)'\).

  • Spatial location index \(\mathbf{s}\).


  • \(\mathbf{C}\) is a spatial covariance matrix with \(ij\)th entry \(\mathbf{C}_{ij} = \mathbf{C}\left( \mathbf{s}_i, \mathbf{s}_j \right)\).

Example: spatial regression model

  • Let \(\mathbf{Z}\) be a set of spatial basis functions where \(\mathbf{Z} \mathbf{Z}' = \mathbf{C}\).

\[ \begin{align*} y \left( \mathbf{s} \right) & = \mathbf{X} \left( \mathbf{s} \right) \boldsymbol{\beta} + \eta \left( \mathbf{s} \right) + \varepsilon \left( \mathbf{s} \right) \\ & = \mathbf{X} \left( \mathbf{s} \right) \boldsymbol{\beta} + \mathbf{Z} \left( \mathbf{s} \right) \boldsymbol{\alpha} + \varepsilon \left( \mathbf{s} \right). \end{align*} \]


  • \(\mathbf{Z}\) can be chosen to provide fast computation, reduced-rank computation, or other properties of interest if you use the approximation \(\mathbf{Z} \boldsymbol{\Sigma}_{\alpha} \mathbf{Z}' \approx \mathbf{C}\).

Example: spatial regression model

  • Let \(\mathbf{C} = \mathbf{U} \mathbf{D} \mathbf{U}'\) be an eigen-decomposition. Then setting \(\mathbf{Z}= \mathbf{U}\) and \(\boldsymbol{\alpha} \sim \operatorname{N} \left( \mathbf{0}, \mathbf{D} \right)\) gives an equivalent model

\[ \begin{align*} \mathbf{y} & = \int \operatorname{N} \left( \mathbf{X} \boldsymbol{\beta} + \mathbf{U} \boldsymbol{\alpha}, \sigma^2 \mathbf{I} \right) \operatorname{N} \left( \mathbf{0}, \boldsymbol{D} \right) \, \mathrm{d} \boldsymbol{\alpha} \\ & = \operatorname{N} \left( \mathbf{X} \boldsymbol{\beta}, \sigma^2 \mathbf{I} + \mathbf{U} \mathbf{D} \mathbf{U}' \right) \\ & = \operatorname{N} \left( \mathbf{X} \boldsymbol{\beta}, \sigma^2 \mathbf{I} + \mathbf{C} \right). \end{align*} \]

Making Bayesian spatio-temporal generalized linear models second-order

Idea

  • Integrating out the latent mean structure improves model stability and results in better model fitting.

  • However, integrating out the random effect (usually) requires a Gaussian likelihood.

  • How can we perform integrated Bayesian inference to get the computational advantages of the second-order representation in non-Gaussian models?

Independent data

  • Single stage (joint) model posterior

\[ \begin{align*} [\mathbf{z}, \boldsymbol{\theta}_z, \boldsymbol{\theta}_y | \mathbf{y}] & = \prod_{i=1}^N [y_i | z_i, \boldsymbol{\theta}_y] [\mathbf{z} | \boldsymbol{\theta}_z] [\boldsymbol{\theta}_y] [\boldsymbol{\theta}_z]. \end{align*} \]

  • Specify full conditional distributions and sample.

  • Metropolis-Hastings, Gibbs, slice sampling, Hamiltonian Monte Carlo, etc.

Non-Gaussian spatial models

  • \(\mathbf{z}\) is a spatial random process .

    • \(\mathbf{z} \sim \operatorname{N} (\mathbf{0}, \boldsymbol{C})\).
  • Must sample the process model (random effect \(\mathbf{z}\)) directly – cannot integrate out the process variable.

  • Computational complexity grows with \(O(n^3)\).

  • Inefficient to sample from a complicated likelihood surface.

Alternative model

  • Fit the model in two stages.

  • Start with a simple first stage model.

  • Use output of first stage model as input into second stage model.

  • First stage model generates an importance sampling proposal distribution for the second stage.

  • If properly chosen, can integrate out the latent process \(\mathbf{z}\) to reduce computational cost.

First Stage

  • First, fit a simple model:

\[ \begin{align*} [\tilde{z}_i | y_i] & = [y_i | \tilde{z}_i] [\tilde{z}_i ]. \end{align*} \]

  • Assign each of the \(\tilde{z}_i\) independent transient priors \([\tilde{z}_i]\).

  • Posterior should be overdispersed relative to the target posterior.

  • Can choose \([\tilde{z}_i]\) to allow for analytic inference.

  • Generate \(k_1 = 1, \ldots, K_1\) first stage samples \(\tilde{z}_i^{(k_1)}\).

Second stage estimation–Metropolis-Hastings

  1. Current state of variable \(z\) at the \(k\)th MCMC iteration is \(z^{(k)}\).

  2. Propose \(z^{\star}\) from proposal distribution \([z^{\star}|z^{(k)}]\).

  3. Calculate MH ratio \(a^{\star} = \frac{[y|z^{\star}] [z^{\star}] [z^{\star} | z^{(k-1)}]}{[y|z^{(k-1)}] [z^{(k-1)}] [z^{(k-1)} | z^{\star}]}\).

  4. Set

\[ \begin{align*} z^{(k)} = \begin{cases} z^{\star} & \mbox{ with probability } \min(1, a^{\star}), \\ z^{(k-1)} & \mbox{otherwise}. \end{cases} \end{align*} \]

Second stage estimation

  • Use the posterior samples \(\tilde{z}^{(k_1)}\) for \(k_1 = 1, \ldots, K_1\) from the first stage posterior distribution \([\tilde{z}_i | y_i]\) as a proposal for second stage estimation.

  • MH update for the \(k_2\)th second stage iteration:

  • Sample \(\tilde{z}_i^{\star}\) from the \(K_1\) first stage posterior samples from \([\tilde{z}_i | y_i ]\) (i.e., set \(\star\) to one of \(1, \ldots, K_1\)).

  • MH-ratio is \[ \begin{align} \label{eq:MH} a_i^{(k_2)} & = \frac{\color{blue}{[y_i | \tilde{z}_i^{\star}]} [\tilde{z}_i^{\star} | \tilde{\mathbf{z}}_{-i}, \boldsymbol{\theta}_z^{(k_2-1)}] \color{red}{[\tilde{z}^{(k_2-1)}_i | y_i]} }{\color{red}{[y_i | \tilde{z}^{(k_2-1)}_i]} [\tilde{z}^{(k_2 - 1)}_i | \tilde{\mathbf{z}}_{-i}, \boldsymbol{\theta}^{(k_2-1)}] \color{blue}{[\tilde{z}^{\star}_i | y_i]} }. \end{align} \]

Second Stage

  • Both \(\color{red}{[\tilde{z}^{(k_2-1)}_i | y_i]}\) and \(\color{blue}{[\tilde{z}^{\star}_i | y_i]}\) are samples from the first stage posterior:

  • Replace these densities by their un-normalized first stage densities \(\color{red}{[y_i | \tilde{z}^{(k_2-1)}_i]} \color{orange}{[\tilde{z}^{(k_2-1)}_i]}\) and \(\color{blue}{[y_i | \tilde{z}^{\star}_i]} \color{orange}{[\tilde{z}_i^{\star}]}\).

  • Then, the MH acceptance ratio becomes

\[ \begin{align} a_i^{(k_2)} & = \frac{\color{blue}{[y_i | \tilde{z}_i^{\star}]} [\tilde{z}_i^{\star} | \mathbf{z}_{-i}, \boldsymbol{\theta}_z^{(k_2-1)}] \color{red}{[y_i | \tilde{z}^{(k_2-1)}_i]} \color{orange}{[\tilde{z}^{(k_2-1)}_i]} } {\color{red}{[y_i | \tilde{z}^{(k_2-1)}_i]} [\tilde{z}^{(k_2 - 1)}_i | \tilde{\mathbf{z}}_{-i}, \boldsymbol{\theta}^{(k_2-1)}] \color{blue}{[y_i | \tilde{z}^{\star}_i]} \color{orange}{[\tilde{z}_i^{\star}]} }. \\ \nonumber & = \frac{[\tilde{z}_i^{\star} | \mathbf{z}_{-i}, \boldsymbol{\theta}_z^{(k_2-1)}] \color{orange}{[\tilde{z}^{(k_2-1)}_i]} }{[z^{(k_2 - 1)}_i | \mathbf{z}_{-i}, \boldsymbol{\theta}_z^{(k_2-1)}] \color{orange}{[\tilde{z}_i^{\star}]} }. \end{align} \]

  • \(\color{orange}{[\tilde{z}_i^{\star}]}\) and \(\color{orange}{[\tilde{z}_i^{(k_2 - 1)}]}\) are the first-stage, transient prior distributions.

Second Stage

  • The ratio \(\frac{\color{blue}{[y_i | z_i^{\star}]} \color{red}{[y_i | \tilde{z}^{(k_2-1)}_i]} }{ \color{red}{[y_i | z^{(k_2-1)}_i]} \color{blue}{[y_i | \tilde{z}^{\star}_i]} } = 1\) so this drops out of the MH ratio.

  • Notice the following doesn’t use the data:

\[ \begin{align} a_i^{(k_2)} & = \frac{[z_i^{\star} | \mathbf{z}_{-i}, \boldsymbol{\theta}_z^{(k_2-1)}] [\tilde{z}^{(k_2-1)}_i] }{[z^{(k_2 - 1)}_i | \mathbf{z}_{-i}, \boldsymbol{\theta}^{(k_2-1)}] [\tilde{z}_i^{\star}] }. \end{align} \]

  • If the data model \([y_i|z_i]\) is non-Gaussian but \([\mathbf{z}_i|\boldsymbol{\theta}_z]\) is Gaussian, we can integrate out the random effect.

  • Allows for fast, efficient sampling.

Second Stage – Importance Sampling

  • Negative – MH: many of the proposed values from \([\tilde{z}_i | y_i ]\) will not have a high probability of acceptance.

  • Importance sample over all of the first stage samples \([\tilde{z}_i | y_i ]\).

  • Non-normalized importance weights:

\[ \begin{align*} w_i^{(k_1, k_2)} & = [z_i^{(k_1)} | \mathbf{z}_{-i}, \boldsymbol{\theta}_z^{(k_2-1)}] [\tilde{z}^{(k_2-1)}_i]. \end{align*} \]

Second Stage – Importance Sampling

  • Each of these weights requires evaluation of the density \([z_i^{(k_1)} | \mathbf{z}_{-i}, \boldsymbol{\theta}_z^{(k_2-1)}]\) which can be computationally costly and is evaluated \(N\) times per MCMC sample.

  • Most of the first-stage samples are not likely in the second stage.

  • Wasted computational effort to evaluate weights for unlikely samples.

  • Subsampled importance samples:

    • Assign a latent variable \(\delta_i^{(k_1)}\) and only evaluate the importance weights when \(\delta_i^{(k_1)} = 1\).

    • Embarrassingly parallel update.

Second Stage – Alternative options

  • Instead of using the first stage posterior samples from \([\tilde{z}(\mathbf{s}_i) | y(\mathbf{s})_i]\) as a MH proposal, apply a density estimation method (e.g., kernel density estimation, Dirichlet process estimation, etc.) and use the estimated density to generate proposals.

  • This can be done relatively easily using either an accept-reject algorithm or an importance sampler.

Example: Spatial GLM

  • In the Bayesian spatial glm, the process model is of the form

\[ \begin{align*} y \left( \mathbf{s}_i \right) & = g^{-1} \left( z \left( \mathbf{s}_i \right) \right), \\ z \left( \mathbf{s}_i \right) & = \mathbf{X} \left( \mathbf{s}_i \right) \boldsymbol{\beta} + \eta \left( \mathbf{s}_i \right). \end{align*} \]

  • \(g\left( \cdot \right)\) is an appropriately chosen link function.

  • \(\mathbf{X} \left( \mathbf{s}_i \right)\) is a set of fixed effect covariates.

  • \(\boldsymbol{\beta}\) are regression coefficients.

  • \(\boldsymbol{\eta} = \left( \eta \left(\mathbf{s}_1 \right), \ldots, \eta \left( \mathbf{s}_N \right) \right)' \sim \operatorname{N} (\mathbf{0}, \mathbf{C})\) is a random spatial process.

Bayesian spatial GLMs

  • In general, sampling a high-dimensional latent effect like \(\boldsymbol{\eta}\) in BHMs is a challenging and unsolved problem.

  • Typical solution for the Gaussian case – integrate out the latent variable and model the covariance.

  • Not possible in non-Gaussian models (GLMs).

  • In the two-stage model, the data distribution \([y(\mathbf{s})_i | z(\mathbf{s}_i), \boldsymbol{\theta}_y]\) does not appear in the Metropolis ratio .

    • In the second stage posterior, one can directly integrate out the latent random variable \(\boldsymbol{\eta}\) and there is no requirement to sample a high-dimensional random variable which improves MCMC mixing.

Bayesian spatial Poisson regression

  • Let \(\mathbf{s}_i\) represent a spatial location and \(y(\mathbf{s}_i)\) be the observation.

\[ \begin{align*} y(\mathbf{s}_i) | \lambda(\mathbf{s}_i) & \stackrel{iid}{\sim} \operatorname{Poisson}\left( \lambda \left( \mathbf{s}_i \right) \right), \end{align*} \]

where the latent intensity \(\boldsymbol{\lambda} = \left( \lambda \left( \mathbf{s}_1 \right), \ldots, \lambda \left(\mathbf{s}_n \right) \right)'\) is modeled using a log-link function with the distribution

\[ \begin{align*} \log(\boldsymbol{\lambda}) & = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\eta}, \\ \boldsymbol{\eta} | \tau^2, \boldsymbol{\theta} & \sim \operatorname{N} \left( \mathbf{0}, \tau^2 \mathbf{R} \left( \boldsymbol{\theta} \right) \right), \end{align*} \] where \(\tau^2\) is a variance parameter and \(\boldsymbol{\theta}\) is a set of parameters for the spatial correlation matrix \(\mathbf{R} \left( \boldsymbol{\theta} \right)\).

  • Note: \(\boldsymbol{\eta}\) is a latent random vector of dimension \(n\) which cannot be analytically integrated out of the model.

Spatial Poisson Regression - First stage

\[ \begin{align*} y(\mathbf{s}_i) | \tilde{\lambda}(\mathbf{s}_i) & \stackrel{iid}{\sim} \operatorname{Poisson}\left( \tilde{\lambda} \left( \mathbf{s}_i \right) \right), \end{align*} \]

where

\[ \begin{align*} \lambda(\mathbf{s}_i) & \sim \operatorname{Gamma} \left( \epsilon_\lambda, \epsilon_\lambda \right). \end{align*} \]

  • For the second stage estimates to have finite variance, the first stage posterior distribution \(\left[ \tilde{\lambda} \left( \mathbf{s}_i \right) | \cdot \right]\) should have heavier tails than the target posterior distribution \(\left[ \lambda \left( \mathbf{s}_i \right) | \cdot \right]\).

  • \(\tilde{\lambda} \left( \mathbf{s}_i \right) \sim \operatorname{Gamma}(\epsilon_\lambda, \epsilon_\lambda)\) for some small \(\epsilon_\lambda\) will produce a diffuse, analytic first-stage posterior.

    • The analytic solution is beneficial because it allows for fast, efficient simulation of the first stage posterior.

Spatial Poisson Regression - Second stage

  • Metropolis-Hastings update for \(\lambda(\mathbf{s}_i)\) where the proposal distribution \(\lambda^{\star}(\mathbf{s}_i) \sim [\lambda(\mathbf{s}_i) | \mathbf{y}]\) is an independent, discrete sample from the first stage posterior with acceptance probability \(\min(1, a_i^{\star})\) where

\[ \begin{align} \label{eq:MH-poisson} a_i^{\star} & = \frac{[y(\mathbf{s}_i) | \lambda^{\star}(\mathbf{s}_i)] [\lambda^{\star} ( \mathbf{s}_i) | \lambda(\mathbf{s}_{-i}), \boldsymbol{\eta}^{(k_2-1)}, \boldsymbol{\beta}^{(k_2-1)}] [\lambda^{(k_2-1)}(\mathbf{s}_i) | y_i] }{[y(\mathbf{s}_i) | \lambda^{(k_2-1)}(\mathbf{s}_i)] [\lambda^{\star} ( \mathbf{s}_i) | \lambda(\mathbf{s}_{-i}), \boldsymbol{\eta}^{(k_2-1)}, \boldsymbol{\theta}^{(k_2-1)}] [\lambda^{\star}(\mathbf{s}_i) | y(\mathbf{s}_i)] }, \\ \nonumber & = \frac{ \left[\lambda^{\star} ( \mathbf{s}_i) | \lambda(\mathbf{s}_{-i}), \boldsymbol{\eta}^{(k_2-1)}, \boldsymbol{\beta}^{(k_2-1)} \right] \left[ \tilde{\lambda}^{(k_2-1)} \left( \mathbf{s}_i \right) \right] }{\left[\lambda^{(k_2-1)} ( \mathbf{s}_i) | \lambda(\mathbf{s}_{-i}), \boldsymbol{\eta}^{(k_2-1)}, \boldsymbol{\beta}^{(k_2-1)} \right] \left[ \tilde{\lambda}^{\star} \left( \mathbf{s}_i \right) \right]}. \end{align} \]

  • Requires samples of the latent \(n\)-dimensional vector \(\boldsymbol{\eta}\).

Spatial Poisson Regression - Second stage

However, the latent parameter \(\boldsymbol{\eta}\) in

\[ \begin{align} \left[\lambda^{\star} ( \mathbf{s}_i) | \lambda(\mathbf{s}_{-i}), \boldsymbol{\eta}^{(k_2-1)}, \boldsymbol{\beta}^{(k_2-1)} \right], \end{align} \]

can be integrated out of the model so that

\[ \begin{align*} \log \left( \lambda(\mathbf{s}_i) \right) | \log \left( \boldsymbol{\lambda}(\mathbf{s}_{-i}) \right) & \sim \operatorname{N} \left( \bar{\mu}_i, \bar{\sigma}^2_{i} \right), \end{align*} \]

where \(\boldsymbol{\lambda}(\mathbf{s}_{-i})\) is the vector of all of the \(\lambda(\mathbf{s}_j)\)s with \(j \neq i\) and

\[ \begin{align*} \bar{\mu}_i & = \mathbf{X}_i \boldsymbol{\beta} + \boldsymbol{\Sigma}_{i, -i} \boldsymbol{\Sigma}^{-1}_{-i, -i} \left( \log \left( \boldsymbol{\lambda} (\mathbf{s}_{-i}) \right) - \mathbf{X}_i \beta \right), \\ \bar{\sigma}^2_i & = \boldsymbol{\Sigma}_{i, i} - \boldsymbol{\Sigma}_{i, -i} \boldsymbol{\Sigma}^{-1}_{-i, -i} \boldsymbol{\Sigma}_{-i, i}, \end{align*} \]

with \(\boldsymbol{\Sigma}_{i, i} = \operatorname{Cov} \left( \log \left( \lambda(\mathbf{s}_i) \right), \log \left( \lambda(\mathbf{s}_i) \right) \right)\) and \(\boldsymbol{\Sigma}_{i, -i} = \operatorname{Cov} \left( \log \left( \lambda(\mathbf{s}_i) \right), \log \left( \boldsymbol{\lambda}(\mathbf{s}_{-i}) \right) \right)\).

Spatial Simulation study

  • Spatial Poisson and logistic regression.

  • Full rank spatial process of 50, 100, 200, 300, 400, 500, 750, and 1000 samples.

    • Note: typically full-rank spatial models are computationally intractable near about 5000-10000 samples.

    • Computational complexity scales with \(O(n^3)\) and memory scales as \(O(n^2)\).

  • For real-world problems, use reduced rank ideas like those presented earlier to scale to massive datasets.

Simulation study

Simulation study - Posteriors for parameters

Simulation study: \(\boldsymbol{\beta}\) estimates

Thanks for your attention

References

  • Banerjee S, Gelfand AE, Finley AO, Sang H (2008). “Gaussian predictive process models for large spatial data sets.” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(4), 825-848.
  • Murray I, Adams RP, MacKay DJC (2010). “Elliptical slice sampling.” In AISTATS, volume 13, 541-548.
  • Polson NG, Scott JG, Windle J (2013). “Bayesian inference for logistic models using Pólya-Gamma latent variables.” Journal of the American statistical Association.
  • Holmström L, Ilvonen L, Seppä H, Veski S (2015). “A Bayesian spatiotemporal model for reconstructing climate from multiple pollen records.” The Annals of Applied Statistics, 9(3), 1194-1225.
  • Linderman S, Johnson M, Adams RP (2015). “Dependent multinomial models made easy: Stick-breaking with the Pólya-Gamma augmentation.” In Advances in Neural Information Processing Systems, 3456-3464.
  • Nychka D, Bandyopadhyay S, Hammerling D, Lindgren F, Sain S (2015). “A multiresolution Gaussian process model for the analysis of large spatial datasets.” Journal of Computational and Graphical Statistics, 24(2), 579-599.
  • Hefley TJ, Broms KM, Brost BM, Buderman FE, Kay SL, Scharf HR, Tipton JR, Williams PJ, Hooten MB (2017). “The basis function approach for modeling autocorrelation in ecological data.” Ecology, 98(3), 632-646.
  • Hooten MB, Johnson DS, Brost BM (2019). “Making Recursive Bayesian Inference Accessible.” The American Statistician, 1-10.
  • Nolan C, Tipton J, Booth RK, Hooten MB, Jackson ST (2019). “Comparing and improving methods for reconstructing peatland water-table depth from testate amoebae.” The Holocene, 29(8), 1350-1361.
  • Tipton JR, Hooten MB, Nolan C, Booth RK, McLachlan J (2019). “Predicting paleoclimate from compositional data using multivariate Gaussian process inverse prediction.” The Annals of Applied Statistics, 13(4), 2363-2388.